ci: run pre-publish benchmark gate on every PR by carlos-alm · Pull Request #1072 · optave/ops-codegraph-tool

carlos-alm · 2026-05-06T05:08:46Z

Summary

Ports the pre-publish-benchmark job from publish.yml into ci.yml so the regression gate runs on every PR instead of only at release time.
Reuses the existing native-host-build artifact (linux-x64), runs in parallel with the rest of CI, and is wired into ci-pipeline needs so a regression fails the PR.
Preserves CODEGRAPH_FAST_SKIP_DIAG=1 and the verbose vitest reporters so when something regresses, the logs already pinpoint the cause without re-running locally.

Why

Until now, the regression gate only ran inside publish.yml's pre-publish path (stable releases / workflow_dispatch). Regressions could land on main and were only caught at the next release. The recent diagnostics work (#1067, #1066 follow-ups) makes the gate's logs genuinely useful — moving it onto every PR puts that signal in front of the change that introduced the regression, while it's still cheap to fix.

How it differs from publish.yml's gate

Version label: --version dev instead of a semver. update-benchmark-report.ts treats dev as a rolling entry (replaces previous dev); the regression-guard's findLatestPair compares dev against the most recent non-skipped release.
Artifact source: downloads native-host-ubuntu-latest from CI's existing matrix instead of native-linux-x64 from publish's matrix — no new Rust build added.
No history-file artifact upload: that's only needed by the post-publish record-benchmarks job in benchmark.yml.
JSON results uploaded with if: always(): failed gates are debuggable without re-running the suite locally.

Test plan

CI green on this PR — confirms the new job builds, downloads the host artifact, runs all benchmark scripts (build/resolution/query/incremental), passes the resolution thresholds, tracer validation, and regression guard.
Verify the new job runs in parallel with lint/typecheck/test (only depends on native-host-build).
On a follow-up PR that intentionally regresses a metric (or just look at logs), confirm the gate fails with the diagnostic context (CODEGRAPH_FAST_SKIP_DIAG=1, verbose reporters, JSON artifact attached).

Mirrors the regression gate from publish.yml so PRs catch performance regressions before merge instead of at release time. The gate measures the PR-built native artifact as version "dev", appends to the benchmark history files in-job, and fails if any metric regresses beyond threshold vs the most recent release baseline. Reuses the existing native-host-build artifact (linux-x64), runs in parallel with the rest of CI, and preserves CODEGRAPH_FAST_SKIP_DIAG so fast-skip rejection reasons surface in CI logs when issues happen. JSON results upload on always() for post-failure debugging.

claude · 2026-05-06T05:08:58Z

Claude finished @carlos-alm's task —— View job

Reviewing PR: ci: run pre-publish benchmark gate on every PR

Todo List:

Read and analyze the changed workflow file
Phase 1: Root cause analysis
Phase 2: Configuration change detection
Phase 3: Test integrity assessment
Phase 4: Code quality review
Phase 5: Backlog compliance evaluation
Provide final assessment and rating

Root Cause Analysis

Problem Being Solved: Performance regressions were only being caught at release time via the pre-publish benchmark gate in publish.yml, allowing regressions to land on main and only surface during the next release cycle when they're expensive to fix.

Why This Approach: Moving the gate to run on every PR means regressions are caught immediately when introduced, while the diagnostic context is still fresh and the fix is cheap. The recent diagnostics improvements (#1067, #1066) make the gate logs genuinely useful for debugging.

Risk Assessment: This is a defensive change that adds quality gates without removing existing safeguards. The approach mirrors the existing proven gate from publish.yml with minimal modifications.

Backlog Compliance

Zero-dep: ✓ — No new runtime dependencies added, builds on existing benchmark infrastructure
Foundation-aligned: ✓ — Aligns with P1 (always-current) by preventing regressions from reaching main, supports P5 (embeddable-first) by ensuring API stability
Problem-fit: 4 — Directly improves development workflow by catching performance regressions before they affect users/agents
Breaking: No — Purely additive CI enhancement, no changes to user-facing APIs or behavior
Tier: 1

Critical Concerns

None identified. This is a well-executed defensive improvement.

Strengths:

Smart artifact reuse (lines 339-343): Reuses native-host-ubuntu-latest from existing CI matrix instead of building a separate artifact, avoiding duplication
Proper parallelization (line 310, 444): Depends only on native-host-build, runs parallel with other checks, doesn't extend critical path
Rolling version strategy (line 374): Uses --version dev as a rolling entry that replaces previous dev runs, compared against latest stable baseline
Debuggable failures (lines 431-440): if: always() ensures JSON artifacts are uploaded even on failure, making regressions debuggable without local reproduction
Preserved diagnostics (line 316): Maintains CODEGRAPH_FAST_SKIP_DIAG=1 for the incremental rebuild regression investigation
Identical test coverage: All four benchmark suites (build/resolution/query/incremental) plus threshold gates and tracer validation preserved from publish.yml

Technical Implementation Quality:

Clean duplication strategy: Rather than complex abstraction, cleanly duplicates the proven logic with targeted modifications (version label, artifact source)
Consistent with existing patterns: Uses same timeout values, shell scripting patterns, and error handling as other CI jobs
Proper gate integration: Correctly wired into ci-pipeline needs array (line 444) so regressions fail the PR

Minor Considerations:

Resource usage: This adds ~5-10 minutes of compute time per PR, but given the value of catching regressions early, this is well-justified overhead
No history upload: Correctly omits the history file artifact upload since that's only needed for the post-publish recording job

Final Recommendation

Rating: ⭐⭐⭐⭐☆ (4/5)
Action: APPROVE
Reasoning: This is a high-quality defensive improvement that addresses a real pain point (late regression detection) with a proven, well-implemented solution. The approach is technically sound, preserves all safeguards, and follows existing patterns. The only reason it's not 5 stars is that it's "just" moving existing functionality rather than innovative new capability—but that's actually a strength for this type of quality gate.

greptile-apps · 2026-05-06T05:11:14Z

Greptile Summary

This PR ports the pre-publish-benchmark job from publish.yml into ci.yml, so the regression gate runs on every PR rather than only at release time. It also fixes two pre-existing silent failures in regression-guard.test.ts: the dev build was being excluded from the import-resolution sub-benchmark, and the version-gap check prevented dev from ever being selected as the latest candidate in findLatestPair.

CI job added: pre-publish-benchmark runs in parallel with lint/test/typecheck on PRs, downloads the native-host-ubuntu-latest artifact, tags results as --version dev, and gates the PR via ci-pipeline's needs array; the job is skipped on push-to-main to avoid duplicating minutes.
findLatestPair fix: adds an isDevLatest flag that bypasses the effectiveGap check so dev is always compared against the most recent non-skipped release, not silently dropped.
Import-resolution fix: replaces the hard e.version !== 'dev' filter with findLatestPair, and propagates baselineVersion to assertNoRegressions (and the precision/recall loops) so KNOWN_REGRESSIONS entries are correctly matched even when latest.version === 'dev'.

Confidence Score: 5/5

Safe to merge — the benchmark job is correctly gated to PR events only, all regression-guard logic paths now include dev builds, and the ci-pipeline check handles the conditionally-skipped job without false failures on push events.

Both previously identified silent failures in the regression guard are addressed: dev can now be selected as the latest candidate in findLatestPair, and the import-resolution section uses the same findLatestPair path as all other sub-benchmarks. The workflow wiring and skip semantics on non-PR events are correct.

No files require special attention.

Important Files Changed

Filename	Overview
.github/workflows/ci.yml	Adds `pre-publish-benchmark` job with correct conditional (`pull_request` only), parallel execution, and `ci-pipeline` wiring; skipped-job semantics on push events are safe because `ci-pipeline` only fails on 'failure'/'cancelled', not 'skipped'.
tests/benchmarks/regression-guard.test.ts	Fixes `isDevLatest` gap-check bypass in `findLatestPair`, migrates import-resolution to use `findLatestPair`, propagates `baselineVersion` everywhere so KNOWN_REGRESSIONS entries are correctly honoured for `dev` builds; adds `3.10.0:fnDeps depth 1` exemption.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([PR opened / push event]) --> B{event_name?}
    B -- pull_request --> C[native-host-build]
    B -- push to main --> D[lint / test / typecheck\naudit / verify-imports\nrust-check / parity]

    C --> E[pre-publish-benchmark\nif: pull_request]
    C --> D

    E --> E1[Download native-host-ubuntu-latest]
    E1 --> E2[npm install + build TS]
    E2 --> E3[Run build benchmark\n→ benchmark-result.json]
    E3 --> E4[Run resolution benchmark\n→ resolution-result.json]
    E4 --> E5[Gate on resolution thresholds\nvitest verbose]
    E5 --> E6[Run tracer validation]
    E6 --> E7[Merge resolution into build result]
    E7 --> E8[Run query benchmark]
    E8 --> E9[Run incremental benchmark]
    E9 --> E10[Update build / query / incremental reports]
    E10 --> E11[Regression guard\nRUN_REGRESSION_GUARD=1]
    E11 --> E12[Upload JSON artifacts\nif: always]

    D --> F[ci-pipeline\nif: always]
    E12 --> F
    B -- push to main\nskips benchmark --> F

    F --> G{any failure\nor cancelled?}
    G -- yes --> H([❌ PR blocked])
    G -- no --> I([✅ PR gates pass])

_{Reviews (13): Last reviewed commit: "Merge branch 'main' into ci/regression-g..." | Re-trigger Greptile}

greptile-apps · 2026-05-06T05:11:19Z

+      - name: Run build benchmark
+        run: |
+          STRIP_FLAG=$(node -e "const [M]=process.versions.node.split('.').map(Number); console.log(M>=23?'--strip-types':'--experimental-strip-types')")
+          node $STRIP_FLAG --import ./scripts/ts-resolve-loader.js scripts/benchmark.ts --version dev --dist > benchmark-result.json
+
+      - name: Run resolution benchmark
+        run: |
+          STRIP_FLAG=$(node -e "const [M]=process.versions.node.split('.').map(Number); console.log(M>=23?'--strip-types':'--experimental-strip-types')")
+          node $STRIP_FLAG --import ./scripts/ts-resolve-loader.js scripts/resolution-benchmark.ts --version dev --dist > resolution-result.json
+
+      - name: Gate on resolution thresholds
+        timeout-minutes: 30
+        run: npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts --reporter=verbose
+
+      - name: Run tracer validation (same-file edge recall)
+        timeout-minutes: 10
+        run: npx vitest run tests/benchmarks/resolution/tracer/tracer-validation.test.ts --reporter=verbose
+
+      - name: Merge resolution into build result
+        run: |
+          node -e "
+            const fs = require('fs');
+            const build = JSON.parse(fs.readFileSync('benchmark-result.json', 'utf8'));
+            const resolution = JSON.parse(fs.readFileSync('resolution-result.json', 'utf8'));
+            build.resolution = resolution;
+            fs.writeFileSync('benchmark-result.json', JSON.stringify(build, null, 2));
+          "
+
+      - name: Run query benchmark
+        run: |
+          STRIP_FLAG=$(node -e "const [M]=process.versions.node.split('.').map(Number); console.log(M>=23?'--strip-types':'--experimental-strip-types')")
+          node $STRIP_FLAG --import ./scripts/ts-resolve-loader.js scripts/query-benchmark.ts --version dev --dist > query-benchmark-result.json
+
+      - name: Run incremental benchmark
+        run: |
+          STRIP_FLAG=$(node -e "const [M]=process.versions.node.split('.').map(Number); console.log(M>=23?'--strip-types':'--experimental-strip-types')")
+          node $STRIP_FLAG --import ./scripts/ts-resolve-loader.js scripts/incremental-benchmark.ts --version dev --dist > incremental-benchmark-result.json


Missing timeouts on benchmark execution steps

The four benchmark run steps (Run build benchmark, Run resolution benchmark, Run query benchmark, Run incremental benchmark) have no timeout-minutes, unlike Gate on resolution thresholds (30 min) and Run tracer validation (10 min). If any benchmark script hangs — for example due to an unresponsive subprocess or an infinite loop in new code — the job will occupy the runner for up to 6 hours before GitHub kills it, blocking the ci-pipeline gate for the entire duration. The equivalent steps in publish.yml share the same gap. Adding per-step timeouts (e.g., 15–20 min each) keeps a stuck job from becoming a multi-hour blocker.

Fixed in 9aca6db — added timeout-minutes: 20 to all four benchmark execution steps (build / resolution / query / incremental) so a hung script can't hold the runner for the GitHub default 6 hours. Same caveat carries over to publish.yml's equivalent steps; left untouched here per one-PR-one-concern (will follow up if needed).

…1072) The benchmark regression guard fails on the same root cause as the existing 3.9.6 known regressions ('Build ms/file', 'No-op rebuild'): PR #1016 expanded AST_TYPE_MAPS from 3 to 23 languages, causing zero-AST-row files to return astNodes: undefined and trigger a full-corpus re-parse. The regression shows up in INCREMENTAL-BENCHMARKS under the 'Full build' label as well, which was missed when PR #1038 added the other 3.9.6 entries. Both engines regressed: native 2148 -> 2986 (+39%), wasm 7563 -> 14036 (+86%). Fixed by PR #1038; reclears with v3.9.7+ benchmark data. Refs #1036, #1037, #1038.

Address Greptile review feedback: - Add 'if: github.event_name == "pull_request"' to skip the gate on push-to-main. The merged PR already passed the gate on this same diff, so re-running on the merge commit doubles CI minutes per landed change with no new signal. Mirrors publish.yml's 'if: github.event_name != "push"' skip. - Add per-step 'timeout-minutes: 20' to the four benchmark execution steps (build/resolution/query/incremental). Without explicit timeouts, a hung script would hold the runner for the GitHub default (6 hours), blocking the ci-pipeline gate for the entire duration.

carlos-alm · 2026-05-06T18:29:38Z

Addressed Greptile review feedback in 9aca6db and fda3c4f:

Event filter (if: github.event_name == 'pull_request') — added to scope the gate to PRs only, mirroring publish.yml's if: github.event_name != 'push' skip. Push-to-main no longer re-runs the full benchmark suite on the merge commit (the merging PR already gated on the same diff).
Per-step timeout-minutes — added 20-minute timeouts to the four benchmark execution steps (build / resolution / query / incremental) so a hung script can't hold the runner for the GitHub default 6 hours.

Also fixed the failing CI:

The pre-publish-benchmark job was tripping the regression guard on incremental Full build (3.9.6 vs 3.9.4), but the underlying regression is the same perf: WASM full-build regression in 3.9.6 (7.6s -> 14.0s, ~85% slower) #1036 / perf: 3.9.6 broad build regression — native +81% build, +60% query; both engines impacted #1037 root cause already documented in KNOWN_REGRESSIONS for Build ms/file and No-op rebuild. PR perf(wasm): scope ensureWasmTrees re-parse to files that need it #1038 added those two labels but missed the Full build label that surfaces in INCREMENTAL-BENCHMARKS — added it (fda3c4f) so the gate stops blocking on a regression that's already fixed and waiting to clear with v3.9.7+ data.

parseSemver('dev') returns null, so effectiveGap('dev', anyRelease) is Infinity — the > MAX_VERSION_GAP check silently rejected every dev → release pairing and findLatestPair fell through to compare the two most recent real releases. The per-PR regression gate was running on static historical data instead of the PR's own dev numbers. Bypass the gap check when latestVersion === 'dev' so dev is always compared against the most recent comparable release. Real releases still respect MAX_VERSION_GAP to avoid stale baselines.

carlos-alm · 2026-05-08T01:46:38Z

Addressed Greptile's outstanding concern about effectiveGap("dev", anyRelease) returning Infinity in dbd48ae:

The bug was real — parseSemver("dev") returns null, so effectiveGap("dev", anyRelease) always exceeded MAX_VERSION_GAP, causing findLatestPair to silently fall through to compare the two most recent real releases instead of dev vs the latest release. That defeated the entire purpose of running the regression guard on every PR.

Fix: in findLatestPair, when latestVersion === 'dev', bypass the version-gap check. Dev is always the current PR build, so the most recent comparable release is the correct baseline regardless of feature-expansion distance. Real releases still respect MAX_VERSION_GAP to avoid stale baselines. All 17 regression-guard tests pass.

carlos-alm · 2026-05-08T01:47:05Z

@greptileai

… is dev (#1072) KNOWN_REGRESSIONS keys are anchored to the release where the regression was first observed (e.g. '3.9.6:No-op rebuild'). When the per-PR gate runs 'dev' as latest, lookups using 'dev:Foo' never match — defeating the exemption mechanism for every documented regression. Fall back to the baseline (previous) version's key when latest is 'dev', so a single '3.9.6:Foo' entry covers both '3.9.6 vs 3.9.5' (release-time) and 'dev vs 3.9.6' (per-PR) until the next release clears the regression and the entry is pruned by the existing stale-entry test.

The native 1-file rebuild regressed from 78ms to ~116ms (build) and 54ms to ~81ms (incremental) when #1069 made backfillNativeDroppedFiles run on every successful orchestrator pass — including incrementals — to repair file_hashes/nodes rows for unsupported-extension files. #1070 fixed the orchestrator side, but the JS-side call stayed unconditional, wasting ~45ms per incremental on the codegraph corpus. Fix is tracked in PR #1082, which gates the backfill call on `isFullBuild || removedCount > 0`. Adding the known-regression entry lets this PR's gate pass while #1082 ships through review; the existing stale-entry test will warn once 3.9.7 lands and the entry stays past its useful life.

carlos-alm · 2026-05-08T09:07:33Z

Pushed two follow-up commits to fix the failing CI on this PR's own gate:

3cf32b1 - fix(bench): look up KNOWN_REGRESSIONS by baseline version when latest is dev. The exemption keys are anchored to the release where the regression was first observed (e.g. 3.9.6:No-op rebuild); when the per-PR gate runs dev as latest, lookups using dev:Foo never matched, so every documented regression resurfaced. Added a baseline-version fallback so a single 3.9.6:Foo entry now covers both 3.9.6 vs 3.9.5 (release-time) and dev vs 3.9.6 (per-PR) until the next release clears the regression.
46492cf - test(bench): mark 3.9.6 1-file rebuild as known regression. After the fallback fix, only 1-file rebuild remained failing (78 → 116ms / 54 → 81ms). Root cause is fix(native): persist file_hashes for dropped/symbol-less files #1069's unconditional backfill on every incremental, fix tracked in perf(native): skip backfill on clean incrementals #1082; added the entry with the same documentation pattern as the other 3.9.6:* exemptions.

Pre-publish benchmark gate is now passing.

The resolveEntries filter excluded 'dev' from the candidate pool, so nativeBatchMs / jsFallbackMs regressions introduced by a PR would silently pass the gate. Switch to findLatestPair (which already pairs dev → most-recent comparable release) so the import-resolution sub-benchmark gets the same per-PR coverage as the rest of the gate.

carlos-alm · 2026-05-12T02:08:09Z

Addressed Greptile's outstanding concern from the May 6 review summary about the import-resolution sub-benchmark in f5f8372:

tests/benchmarks/regression-guard.test.ts — the resolveEntries filter at line 572 excludes dev entries, leaving the import-resolution sub-benchmark without PR-level coverage.

Fix: switched the resolve sub-benchmark from the dev-excluding resolveEntries filter to findLatestPair (which already pairs dev against the most recent comparable release for the engine sub-benchmarks). nativeBatchMs / jsFallbackMs regressions introduced by a PR are now caught by the same gate as build / query / incremental.

Also resolved the merge conflict with main (9194394). The conflict was in regression-guard.test.ts — main's NOISY_METRICS / thresholdFor additions co-exist cleanly with this PR's isDevLatest bypass and baselineVersion fallback. Pruned the stale 3.9.6:Full build / 3.9.6:1-file rebuild entries since main's 3.10.0:* entries now cover those metrics for dev comparisons.

carlos-alm · 2026-05-12T02:08:22Z

@greptileai

CI run 25708925467 surfaced fnDeps depth 1 at 24.7 → 44.9ms (+82%) on the per-PR gate, above the 50% NOISY_METRIC_THRESHOLD. The fn_deps Rust implementation and JS wrapper are byte-for-byte unchanged since v3.9.6 (already documented in the NOISY_METRICS docstring); the ~20ms absolute delta is the shared-runner noise floor on a sub-30ms baseline. Same pattern as the existing 3.10.0:No-op rebuild / 3.10.0:1-file rebuild entries; remove once 3.11.0+ data confirms stabilization under the warmup + 5-sample methodology.

greptile-apps Bot reviewed May 6, 2026

View reviewed changes

carlos-alm and others added 3 commits May 6, 2026 00:05

Merge branch 'main' into ci/regression-gate-on-pr

054b78d

carlos-alm and others added 2 commits May 6, 2026 16:52

Merge branch 'main' into ci/regression-gate-on-pr

e8614e6

carlos-alm and others added 4 commits May 7, 2026 20:00

Merge branch 'main' into ci/regression-gate-on-pr

dcee140

Merge remote-tracking branch 'origin/main' into ci/regression-gate-on-pr

4371bd2

carlos-alm and others added 4 commits May 8, 2026 03:18

Merge remote-tracking branch 'origin/main' into ci/regression-gate-on-pr

7cbe93e

Merge branch 'main' into ci/regression-gate-on-pr

cf7ca99

fix: resolve merge conflicts with main

9194394

carlos-alm and others added 2 commits May 11, 2026 20:26

Merge branch 'main' into ci/regression-gate-on-pr

c040b10

carlos-alm merged commit f6c0289 into main May 12, 2026
21 checks passed

carlos-alm deleted the ci/regression-gate-on-pr branch May 12, 2026 03:24

github-actions Bot locked and limited conversation to collaborators May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: run pre-publish benchmark gate on every PR#1072

ci: run pre-publish benchmark gate on every PR#1072
carlos-alm merged 16 commits into
mainfrom
ci/regression-gate-on-pr

carlos-alm commented May 6, 2026

Uh oh!

claude Bot commented May 6, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 6, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 6, 2026

Uh oh!

carlos-alm May 6, 2026

Uh oh!

carlos-alm commented May 6, 2026

Uh oh!

carlos-alm commented May 8, 2026

Uh oh!

carlos-alm commented May 8, 2026

Uh oh!

carlos-alm commented May 8, 2026

Uh oh!

carlos-alm commented May 12, 2026

Uh oh!

carlos-alm commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented May 6, 2026

Summary

Why

How it differs from publish.yml's gate

Test plan

Uh oh!

claude Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewing PR: ci: run pre-publish benchmark gate on every PR

Root Cause Analysis

Backlog Compliance

Critical Concerns

Strengths:

Technical Implementation Quality:

Minor Considerations:

Final Recommendation

Uh oh!

greptile-apps Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm May 6, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm commented May 6, 2026

Uh oh!

carlos-alm commented May 8, 2026

Uh oh!

carlos-alm commented May 8, 2026

Uh oh!

carlos-alm commented May 8, 2026

Uh oh!

carlos-alm commented May 12, 2026

Uh oh!

carlos-alm commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 6, 2026 •

edited

Loading

greptile-apps Bot commented May 6, 2026 •

edited

Loading